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1.  INTRODUCTION 


A.  BACKGROUND 

Email  has  become  the  standard  for  fast,  inexpensive  and  easily  accessible 
communication.  The  explosive  growth  of  email  is  affecting  everyone  in  the  Department 
of  Defense  as  well  as  the  civilian  work  environment.  Its  largest  impact  is  on  management 
and  record-keeping  personnel.  Typical  military  users  receive  between  30  -  70  emails  a 
day  depending  on  their  rank  and  billet  (Marsan,  2002).  If  one  attaches  spreadsheets, 
documents,  presentations,  graphics  and  executable  programs  then  email  gains  even  more 
value.  Because  of  its  ease  of  use,  email  has  become  an  integral  part  of  military- 
organization  daily  operations.  All  of  the  information  in  unclassified  and  classified  emails 
is  also  a  treasure  trove  of  operational  data. 

B.  THESIS  OBJECTIVE 

This  thesis  seeks  to  improve  the  organization  of  individual  user’s  email  by 
implementing  an  automated  categorizer  for  email.  The  author  seeks  to  try  to  eliminate 
the  large  amounts  of  manual  email  categorization  that  is  currently  done  by  many  users. 
This  could  be  useful  to  military  personnel  due  to  efficiency,  privacy,  and  high-turnover 
concerns.  Turnover  of  military  personnel  happens  every  2-3  years  and  frequently  old 
email  is  simply  deleted  rather  than  organized  and  used  to  document  valuable  operational 
processes  and  data.  Generally  speaking  within  the  DoD  there  are  no  formal  filing  or 
retention  policies  for  email.  There  are  guidelines  such  as  the  DoD  Directive  5012.2,  but 
these  deal  with  large-scale  records  management. 

Old  email  messages  are  required  for  numerous  purposes.  These  include  day-to- 
day  business  operations,  and  requests  for,  historical  financial  information,  activities, 
logistics,  etc.  When  each  individual  decides  which  email  messages  to  retain  or  delete, 
much  information  is  hidden  from  the  rest  of  the  organization.  Email  storage  of  this  kind 
is  scattered  in  personal  archives.  Although  servers  can  store  these  messages,  the  number 
of  messages  is  constantly  increasing.  If  backup  tapes  are  not  kept  for  retrieving  old 
messages,  and  an  individual  goes  on  leave  or  is  unavailable,  critical  information  in  his 
email  is  unavailable.  This  suggests  organizational- level  archiving  and  categorizing,  but 
what  can  be  done  at  the  user  level  to  help  organize  this  information? 
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With  the  average  user  manually  archiving  an  estimated  300  megabytes  (MB)  of 
email  a  year,  categorizing  methods  can  vary  dramatically  in  their  effectiveness  and 
overall  organization  (Ferris,  1999).  Our  primary  goal  of  this  research  was  to  survey 
existing  methods  and  determine  a  method  or  a  combination  of  methods  that  would  work 
well  for  email  categorization.  A  secondary  goal  of  this  research  was  to  create  a  tool  to 
accurately  and  quickly  categorize  and  archive  email  messages  at  the  local  user  level. 

This  thesis  will  attempt  to  answer  the  following  questions: 

•  Can  machine-  learning  programs  accurately  categorize  e-  mail? 

•  What  are  the  strengths  and  weaknesses  of  automated  categorizers? 

C.  OUTLINE  OF  THESIS 

Chapter  II  describes  previous  attempts  at  automated  categorizing  and  other  similar 
problems  in  text  retrieval.  Chapter  III  gives  a  description  of  the  structure  and 
components  of  the  program.  A  detailed  description  of  the  data  and  corpus  is  also  given. 
Chapter  IV  provides  a  description  of  a  categorization  program  that  we  developed. 
Chapter  V  discusses  the  program’s  performance  and  the  accuracy  of  its  results.  Chapter 
VI  reviews  the  program’s  achievements  and  major  weaknesses. 
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11.  OTHER  WORK  ON  TEXT  CATEGORIZATION 


A.  INTRODUCTION 

Text  categorization  has  become  a  very  active  research  topic  over  the  last  few 
years.  Many  of  the  approaches  seek  to  categorize  documents  of  the  Internet.  In  this 
thesis  email  is  the  document  and  specifically  the  text  within  the  email.  Categories  can  be 
summarized  using  phrases,  words,  or  numerically.  Traditionally,  a  domain  expert, 
usually  a  librarian,  does  text  categorization  manually.  Documents  are  read  by  the  expert 
and  then  placed  in  the  appropriate  category.  To  eliminate  the  large  amount  of  manual 
effort  required,  we  could  use  automatic  categorization  that  learns  automatically  from 
using  training  examples.  The  classic  approach  is  to  assign  weights  to  particular  words  in 
particular  categories;  the  inferred  category  of  a  document  is  the  one  with  the  highest 
weighted  sum  (Witten,  Frank,  2000). 

Two  categorization  techniques  used  are  instance-based  learning  and  Naive-Bayes 
probabilistic  classification.  Instance-based  learning  methods  begin  with  a  particular 
example  and  generalize  it  to  cover  other  similar  examples  in  the  same  category.  The 
Naive-Bayes  approach  uses  the  conditional  probabilities  of  categories  given  a  word  to 
estimate  the  probabilities  of  categories  given  an  email  document;  this  model  assumes 
word  independence.  Typically  a  list  of  “stop  words”  to  be  ignored  and  some  sort  of 
destemming  algorithm  are  used  to  help  normalize  the  word  list. 

B.  LITERATURE  REVIEW 

Lam,  Ruiz,  and  Srinivasan  investigated  whether  automatic  categorization  will 
have  better  retrieval  performance  than  that  achieved  using  manual  categorization  applied 
to  medical  documents  (Lam,  Ruiz,  Srinivasan  1999).  They  analyzed  the  retrieval 
performance  on  test  queries  to  gain  insights  on  the  interaction  of  their  categorizer  and  text 
retrieval.  The  first  part  of  their  work  dealt  with  automatic  categorization  including  a 
category- extraction  process.  For  their  test  documents  they  use  a  corpus  of  medical 
documents  from  the  MEDLINE  database  that  is  referred  to  as  the  HERSH  corpus. 

The  authors  ran  a  series  of  experiments  on  parameter  selection  to  provide  a  metric 
and  categorization  results.  Their  results  are  broken  down  into  category  and  document 
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perspectives.  The  category  perspective  results  are  related  to  sizes  of  categories  ranging 
from  10  to  60  categories.  Three  different  parameters  were  tested:  CO,  C35  and  C50.  CO 
used  all  manually  assigned  categories  that  existed  in  the  training  set  and  test  set.  C35  and 
C50  limit  the  number  of  categories  to  those  that  have  a  document  frequency  greater  than 
35  or  50  per  category.  The  document  frequency  is  the  number  of  documents  that  a 
specific  category  is  assigned  to.  The  FI  score  is  a  weighted  combination  of  recall  and 
precision,  with  the  scores  being  averaged  to  determine  a  mean.  Their  results  for 
parameter  selection  can  be  seen  in  Table  1. 


Run 

Parameter  selection  based 
on  training  set 

N 

M 

#of 

categories 

FI  score 

CO 

641 

0.258 

5 

50 

C35 

58 

0.468 

5 

40 

C50 

43 

0.509 

30 

20 

Table  1.  Results  for  Parameter  Selection  in  (Lam,  Ruiz,  Srinivasan  1999). 

The  results  indicate  that  as  the  frequency  threshold  on  the  category  set  increases, 
the  mean  FI  score  improves.  N  represents  the  number  of  documents  while  M  was  the 
number  of  categories. 

Yang  did  a  comparative  evaluation  of  statistical  approaches  to  text  categorization 
(Yang,  1998).  The  author  uses  several  versions  of  the  Reuters  newswire  corpus  of 
20,000  documents  to  evaluate  the  categorization  methods  of  k-nearest  neighbors,  simple 
word  matching,  decision  trees,  Naive-Bayes,  inductive- rule  learning  in  disjunctive 
normal  form,  neural  networks,  Rocchio,  linear  least-squares  fit,  and  “sleeping  experts”. 
The  authors  found  that  linear  least  squares  fit  performed  best. 

In  addition  to  experimenting  with  thresholding  techniques,  Yang  concluded  that 
variability  on  the  performance  of  classifiers  with  collection  is  common.  Although  the 
Word  approach,  which  looks  at  single  word  frequency,  had  increased  performance  when 
changing  from  a  labeled  to  an  unlabeled  corpus,  it  was  still  out  performed  by  other 
methods  such  as  kNN  and  LLSF. 


4 


Moens  and  Dumortier  applied  text  categorization  to  magazine  articles  to  study  the 
effects  of  selection  of  feature  words  and  proper  names  (Marie -Francine,  Dumortier, 
2000).  The  authors  use  a  standardized  approach  of  stop- word  removal  and  then  select 
keywords  by  applying  statistical  weights  to  the  remaining  words  after  stop  words  are 
taken  out.  For  proper  names,  words  with  capitalization  are  given  a  heavier  weighting. 
Terms  with  a  calculated  weight  above  0.4  conditional  probability  are  selected.  The 
authors  also  apply  the  technique  of  “zoning”,  which  is  the  selection  of  word  examples 
that  are  in  close  proximity  to  other  word  examples  within  the  document.  The  results  of 
the  Moens  and  Dumortier  compare  a  Bayesian  independence  classifier  to  the  Rocchio 
algorithm  and  a  ^  algorithm.  The  algorithm  is  used  to  test  how  closely  a  set  of 
observed  frequencies  corresponds  to  a  set  of  expected  frequencies.  The  observed 
frequencies  are  the  number  of  texts  relevant  or  non-relevant  for  the  text  category  that 
contain  the  feature  word.  The  authors  conclude  that  the  X  algorithm  worked  best  with  a 
recall  of  0.73  and  precision  of  0.64  versus  the  Bayesian  method  recall  of  0.58  and 
precision  of  0.61  and  the  Rocchio  algorithm  recall  of  0.64  and  precision  of  0.57. 

Salton  and  Buckley  propose  a  similar  method  to  show  how  similar  one  document 
is  to  a  query  document  by  statistically  weighting  terms  within  the  document  (Salton, 
Buckley,  1988).  The  authors  compare  results  of  eight  different  term  weighting  methods 
on  different  collections  of  documents.  They  make  recommendations  on  query  and 
document  vectors  concerning  the  term- frequency  component,  the  collection  frequency 
and  the  normalization  component.  The  authors  conclude  that  for  short  queries  each  term 
is  important  and  query- term  weights  are  preferred.  When  dealing  with  document  vectors 
the  authors  conclude  that  for  technical  vocabulary,  an  enhanced  frequency- weighting 
scheme  should  be  used  which  places  terms  automatically  between  0.5  and  1.0.  Our 
application  involves  short  technical  documents  and  can  use  this  approach  by  using 
individual  term  weighting  rather  than  using  a  similarity  calculation. 
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m.  DESCRIPTION  OF  APPLICATION 


In  this  sectbn  we  describe  our  application,  and  present  an  algorithm  for 
categorizing  email  documents  using  a  probabilistic  model.  The  algorithm  uses  count 
data,  the  frequency  of  the  terms  in  a  document.  Our  approach  relies  on  keyword  clues.  A 
training  process  identifies  categories  for  new  documents  from  pre- categorized  examples. 
The  categorization  technique  used  in  the  algorithm  is  a  linear  numeric  prediction  model 
(Witten,  Frank,  2000). 

A.  DATA 

A  total  of  737  emails  were  used  to  train  and  test  the  categorizer.  Table  3  provides 
the  data  characteristics  of  the  emails.  Some  emails  were  previously  saved  with  an 
“.html”  extension  and  others  were  saved  with  a  “.txt”  extension.  HTML  tags  were 
identified  and  included  in  the  stop-word  list.  The  collection  of  emails  was  from  the 
author’s  personal  work  archives.  Approximately  20%  pertain  to  the  authors  experience 
as  a  Supply  Officer;  the  remaining  80%  were  collected  during  the  author’s  experience  as 
a  graduate  student. 


Number  of  emails 

737 

Total  number  of  words  in  text  corpus 

31,593 

Total  number  of  words  after  destemming  and  stopword 
removal 

20,115 

Total  number  of  unique  words  after  destemming, 
stopword  removal  and  extraction  of  HTML  tags  and 
other  special  characters 

~  12,000 

Number  of  words  per  email  document 

45  -1082 

Table  2.  Training  and  Testing  Data  Characteristics. 


Table  3  shows  the  specific  category  descriptions  and  example  relevant  words 

which  the  author  thinks  the  program  should  choose  as  keyword  clues  relating  to  a 

category.  The  author  identified  these  categories  by  placing  them  in  logical  categories 
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according  to  their  content.  Except  for  categories  #4,  #5,  #10,  #15,  and  #17,  the  categories 
could  be  applied  to  other  military- service  emails  as  well  as  civilian-business  emails. 


Category  #  1 :  Classes 

Emails  with  administrative  course  material. 

Possible  Key  Words 

class,  info,  course  info,  homework,  homework  problems,  answers 

Category  #  2:  Grades 

Emails  with  information  about  grades  and  transcripts. 

Possible  Key  Words 

grades,  registrar,  python,  final  grade,  homework  grade,  test  grade 

Category  #  3:  Personal 

Emails  with  information  received  from  the  author’s  family  members 
and  other  matters  he  deemed  not  directly  connected  to  his  work 
environment. 

Possible  Key  Words 

trip,  Michelle,  Abby,  love,  us  me,  thanks,  dear 

Category  #  4:  3270 

Emails  dealing  with  connectivity  problems  to  a  mainframe  computer 
using  3270  emulation  software. 

Possible  Key  Words 

3270,  ACID,  password,  connectivity,  lack,  mainframe,  service 
(ACID  -  access  control  identificatbn) 

Category  #  5:  Bwd  Mess 

Emails  that  involved  the  authors  job  as  the  Marine  Officer  in  charge  of 
collecting  wardroom  dues  onboard  the  USS  Belleauwood  from  11/98- 
6/99. 

Possible  Key  Words 

money,  payment,  dues,  mess,  receipt,  check 

Category  #  6:  Equipment 

Emails  that  dealt  with  equipment  issues. 

Possible  Key  Words 

swap,  equipment,  truck,  weapons,  parts,  lead-time,  fix,  gear 

Category  #  7:  Equipment 
Allowance 

Emails  that  deal  with  equipment  allowances  and  what  a  unit  was 
reporting  to  have. 

Possible  Key  Words 

OH,  on-hand,  own,  equipment,  T/0,  temp,  shortage,  overage 

(OH  and  T/0  -  stand  for  on  hand  and  table  of  organization,  which  is 

the  structuring  of  a  unit) 

Category  #  8:  Equipment 
Readiness 

Emails  that  pertain  to  the  physical  condition  of  a  piece  of  equipment 
and  whether  it  was  working  properly  or  not. 

Possible  Key  Words 

maintenance,  parts,  deadline,  fix,  repair,  leadtime,  running,  MIMMS 
(Marine  Corps  Integrated  Maintenance  Mgt.  System) 
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Category  #  9:  Exercises 

Emails  that  pertain  to  military  exercises  the  author  partook  in  and 
various  problems  that  he  resolved  or  worked  on. 

Possible  Key  Words 

Cobra- Gold,  billeting,  exercise,  funding,  travel,  planning,  meeting, 
Y2K 

Category  #10:  Expeditor 

Emails  that  pertain  to  a  person  whose  job  involved  the  explicit 
tracking  of  equipment  through  the  transportation  pipeline. 

Possible  Key  Words 

tracking,  equipment,  package,  arrival,  Carl,  time,  where 

Category  #11:  Eiscal 

Emails  concerning  payment  and  disbursal  matters,  contracts,  and 
equipment  receipts. 

Possible  Key  Words 

payment,  due,  money,  SABRS,  JON,  financial,  authority 

(SABRS  -  standard  accounting  and  budgeting  requirement  system; 

JON  -  job  order  number) 

Category  #12:  General 

Emails  concerning  general  administrative  purposes. 

Possible  Key  Words 

administrative,  requirement,  meeting,  turn-in,  due 

Category  #13: 

Maintenance 

Emails  that  pertain  to  the  physical  condition  of  a  piece  of  equipment, 
whether  it  was  working  properly  or  not,  and  the  parts  status  for  a  piece 
of  equipment. 

Possible  Key  Words 

maintenance,  parts,  deadline,  fix,  repair,  lead-time,  running,  MIMMS, 
status,  up 

(MIMMS  -  Marine  Corps  Integrated  Maintenance  Mgt.  System) 

Category  #  14: 
Miscellaneous 

Emails  that  were  of  mixed  purposes.  They  are  primarily  differentiated 
from  category  #12  by  the  variety  within  each  email  and  that  category 
#12  had  a  general  administrative  theme. 

Possible  Key  Words 

odd,  here,  fun,  get,  read,  keep,  future,  misc 

Category  #15:  NBC 

Emails  that  involved  Nuclear,  Biological  or  Chemical  (NBC)  supply 
issues. 

Possible  Key  Words 

mask,  parts,  atropine,  injector,  NBC,  filter,  gas,  chamber 

9 


Category  #16:  records 

Emails  concerning  supply  record  administration  issues. 

Possible  Key  Words 

CMR,  on- hand,  drop,  add,  quantity,  description,  account,  inventory, 
count 

(CMR  -  consolidated  memorandum  receipt) 

Category  #17:  requests 

Emails  concerning  supply  requests  for  equipment  or  the  purchasing  of 
administration  supplies. 

Possible  Key  Words 

computer,  request,  get,  date,  buy,  purchase,  money,  contract 

Category  #  18:  shipboard 
billeting 

Emails  concerning  issues  involving  billeting  or  berthing  for  Marine 
Officers  on  board  the  USS  Belleauwood.  Most  emails  revolve  around 
room  assignments. 

Possible  Key  Words 

room,  assignment,  billeting,  berthing,  Belleauwood,  officer, 
assignment 

Category  #19:  tech  info 

Emails  that  include  technical  issues  focused  around  the  area  of 
computer  science.  This  area  was  differentiated  from  category  #  1  by 
its  lack  of  specificity  in  many  cases. 

Possible  Key  Words 

networking,  computer,  artificial,  intelligence,  software,  web,  Internet 

Table  3.  Category  Descriptions  and  Possible  Keywords  Identifying  These  Categories. 


The  exact  number  of  texts  used  for  each  category  in  the  training  and  test  sets  are 
given  in  Table  4  below.  The  training  and  test  sets  were  formed  by  placing  80%  of  each 
category  into  the  training  set,  and  the  remaining  20%  into  the  test  set.  A  constraint  was 
the  limited  number  of  examples  in  some  of  the  categories.  As  with  much  text 
categorization,  a  central  problem  is  the  lack  of  standard  data  collections. 
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Category 

Number  of  email  texts  for 
the  training  set 

Number  of  email  texts  for 
the  test  set 

Shipboard  billeting 

13 

4 

Requests 

20 

5 

Records 

2 

0 

NBC 

7 

2 

Miscellaneous 

17 

4 

Maintenance 

4 

2 

General 

4 

1 

Fiscal 

20 

4 

Expeditor 

5 

2 

Exercises 

6 

2 

Equipment  Readiness 

11 

3 

Equipment  Allowances 

8 

3 

Equipment 

11 

2 

BWD  Mess 

3 

0 

3270 

9 

3 

Personal 

242 

61 

Grades 

22 

6 

Classes 

104 

26 

Tech  Info 

88 

22 

Total 

(Total  Emails:  737) 

585 

152 

Table  4.  Number  of  Texts  Per  Category  in  Training  and  Test  Sets. 

B.  EVALUATION  MEASURES 

The  author  uses  conventional  measures  of  recall  and  precision  to  measure 
categorization  accuracy.  They  are  computed  by  selecting  the  highest  value  for  the 
returned  email: 


emails  correctly  assigned  to  a  category 
recall  =  emails  in  a  category 

precision  =  emails  correctly  assigned  to  a  category 
emails  assigned  to  a  category 


Table  5.  Evaluation  Measures. 
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The  author  also  displays  the  results  in  the  form  of  a  two  19x19  confusion  matrices 
that  show  the  true  categories  plotted  against  actual  categories  chosen  after  being  run 
through  our  categorizer. 

C.  METHODOLOGY 

Given  an  email  document,  an  independent  binary  classifier  compares  values  and 
chooses  the  single  category  of  highest  value.  Two  methods  of  preprocessing  text  are 
used.  First  a  word  “destemmer”  algorithm  is  used  (Rowe,  1998).  The  algorithm  removes 
suffixes  on  an  English  word  to  regularize  its  forms.  A  sample  behavior  of  Porter’s 
stemming  algorithm  can  be  seen  in  Table  6  below: 


Word 

Porter  Stemming  Algorithm 

believes 

believ 

working 

work 

starting 

start 

playfully 

play 

Table  6.  Porter  Stemming  Algorithm  Behavior. 

In  addition  to  destemming  words,  a  “stop  word”  removal  hst  is  used  (Rowe, 
1998).  The  stop- word  list  consists  of  700  common  words  such  as  “a”,  “and”,  “the”,  “of 
”,  etc,  that  are  generally  non- informative  and  can  be  removed  to  improve  categorization. 
Some  html  tags  and  special  characters  were  also  added  to  the  stop- word  list  to  eliminate 
redundant  non-useful  characters  such  as  BR,  TR,  and  other  commonly  used  markup  tags. 
Two  of  the  most  common  words,  “of  ”  and  “the”,  account  for  10%  of  word  occurrences 
in  most  documents  (Mooney,  2002).  Sample  stop  word  frequencies  can  be  seen  in  Table 
7. 
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Frequent  word 

Number  of  occurrences 

Percentage  of  Total 

the 

7,398,934 

5.9 

of 

3,893,790 

3.1 

to 

3,364,653 

2.7 

and 

3,320,687 

2.6 

in 

2,311,785 

1.8 

is 

1,559,147 

1.2 

for 

1,313,516 

1.0 

The 

1,144,860 

0.9 

that 

1,066,503 

0.8 

said 

1,027,713 

0.8 

Table  7.  Sample  Word  Frequency  Data:  Frequencies  from  336,310  Documents  in  the  1GB 
TREC  Volume  3  Corpus  125,720,891  Total  Word  Occurrences;  508,209  Unique  Words. 
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IV.  DESCRIPTION  OF  TEXT  CATEGORIZER  PROGRAM 


A.  PROGRAM  STRUCTURE 

The  text  categorizer  was  written  in  Java.  The  overall  architecture  for  the  program 
can  be  seen  in  Figure  1.  The  program  starts  by  accepting  email  documents  in  text 
document  format.  The  training  set  is  manually  categorized  and  then  both  sets  are 
tokenized.  The  email  document  is  then  run  through  a  destemming  program  and  removal 
of  any  of  the  700  stop  words  is  done.  All  capitalized  letters  are  made  into  lower  case. 
The  training  set  is  used  in  the  calculation  of  clue  probabilities.  If  a  word  remains  after 
stop- word  removal  it  must  appear  a  minimum  of  10  times  to  have  its  probability 
calculated.  Additionally,  probabilities  are  calculated  by  viewing  two  subdirectories 
labeled,  “yes”  and  “no”  and  finding  the  conditional  probability  of  a  “yes”  given  the 
occurrence  of  particular.  The  test  set  involves  the  calculation  of  19  weighted  sums  for 
each  document. 

The  “ClueWords”  program  was  adapted  with  minor  changes  from  another 
program,  “GetClueProbs”  (Rowe,  1998).  The  program  was  modified  to  extract  stop 
words.  The  RateDocs  program  was  modified  to  check  for  the  “Subject”  line  of  an  email 
and  increment  the  overall  word  count.  The  “RateDocs”  program  takes  the  weighted  sum 
of  the  number  of  occurrences  of  each  clueword,  and  then  chooses  the  category  of  highest 
total  weight. 

B.  PROGRAM  PERFORMANCE 

All  code  was  written  in  Java  using  JDK1.3.1_02  Java  Virtual  Machine  release. 
The  programs  were  executed  on  a  Pentium  II  Processor  Intel  MMX  chip  running 
Windows  98  with  160  MB  of  RAM.  Table  8  shows  processing  time. 
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manual  categorization 


t 

Actions  Performed  in  dotted  lined  box 
are  performed  by  ClueProbs  program. 


Figure  1. 


Email  test  set 


t 

Actions  Performed  in  dotted  lined  box 
are  performed  by  RateDocs  program. 
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Train/Test 

Number  of  documents 

Real  total  time 
in  seconds 

Real  time 
Per 

Document 
in  seconds 

Training 

585 

225 

4.2 

Test 

152 

165 

0.9 

Table  8.  Processing  Time. 
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V.  EXPERIMENT  RESULTS 


A.  INTRODUCTION 

After  destemming  and  eliminating  stop-words,  the  training  set  consisted  of  585 
different  email  documents  containing  more  than  30,000  words  including  duplicates. 
Minimum  word  counts  were  set  at  5,  10  and  15  words  per  document  and  run  through  our 
clue-probability  program.  The  number  of  clue  words  after  destemming,  elimination  of 
stop- words,  and  changing  of  upper- case  letters  to  lower-case  letters  ranged  from  470  to 
1,866  over  the  different  categories.  The  top  five  keywords  in  conditional  probability  for 
three  sample  categories  are  shown  below  in  Tables  9-11.  “Personal”  is  a  very  large 
category  with  lots  of  keywords  and  high  probabilities.  “Fiscal”  is  a  medium- sized 
category  with  some  high-ranking  clue  words,  and  some  important  low  ones.  “Requests” 
is  a  small- sized  category  with  low- ranked  key  words. 


%  Prob 

#  in  category  (yes) 

#  not  in  category  (no) 

Word 

0.99 

515  5 

hotmail 

0.98 

97  1 

resort 

0.98 

127  2 

sooners90 

0.96 

64  2 

Eove 

0.94 

149  9 

her 

Table  9.  Example  Cluewords  in  Category  Personal. 


%  Prob 

#  in  category  (yes) 

#  not  in  category  (no) 

Word 

0.92 

36 

3 

ABC* 

0.51 

23 

22 

fiscal 

0.27 

15 

39 

money 

0.22 

14 

47 

spend 

0.21 

15 

54 

finance 

Table  10.  Example  Cluewords  in  Category  Eiscal. 


%  Prob 

#  in  category  (yes) 

#  not  in  category  (no) 

Word 

0.53 

15 

13 

mimm 

0.50 

8 

8 

laptop 

021 

5 

13 

gear 

0.25 

4 

12 

machine 

0.12 

34 

230 

request 

Table  11.  Example  Cluewords  in  Category  Requests. 
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Our  classifiers  were  tested  upon  152  new,  previously  unseen  email  texts.  Table 
12  shows  recall  and  precision  for  the  test  set.  Table  13  shows  a  confusion  matrix  of  19 
categories  and  the  152  test  set  documents.  This  shows  which  categories  were  “confused” 
with  one  another  and  which  categories  were  clearly  identified. 


Category 

Recall 

Precision 

Classes 

0.77 

0.18 

Grades 

1.00 

0.05 

Personal 

0.18 

1.00 

3270 

0.67 

0.07 

BWD  Mess 

0.01 

0.00 

Equipment 

0.50 

0.03 

Equipment  Allowance 

0.67 

0.05 

Equipment  Readiness 

1.00 

0.12 

Exercises 

1.00 

0.06 

Expeditor 

0.50 

0.06 

Eiscal 

0.80 

0.10 

General 

1.00 

0.05 

Maintenance 

1.00 

0.07 

Miscellaneous 

0.80 

0.18 

NBC 

0.50 

0.17 

Records 

Not  enough 

Not  enough 

records 

records 

Requests 

0.80 

0.11 

Shipboard  Billeting 

0.80 

0.08 

Info_Tech 

0.32 

0.78 

Average 

0.65 

0.17 

Table  12.  Results  of  Categorizing  Test  Set  Documents.  With  Subject  Line  Code  Included. 
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Category  Name 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

1 

Classes 

113 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

31 

2 

Grades 

30 

111 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

3 

Personal 

4 

0 

110 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

38 

0 

0 

0 

0 

4 

3270 

0 

0 

8 

29 

0 

16 

0 

0 

0 

0 

0 

0 

10 

40 

0 

0 

30 

0 

19 

5 

BWD  Mess 

0 

0 

60 

0 

37 

n 

0 

0 

0 

55 

0 

0 

0 

0 

0 

6 

Equipment 

4 

0 

29 

0 

0 

21 

15 

15 

0 

0 

0 

0 

18 

20 

29 

0 

0 

0 

0 

7 

EqmntAll 

0 

0 

0 

0 

0 

20 

40 

21 

11 

0 

0 

0 

11 

30 

0 

0 

0 

0 

20 

8 

EqmntRead 

0 

2 

0 

0 

0 

30 

29 

42 

20 

0 

7 

14 

0 

0 

8 

0 

0 

0 

0 

9 

Exercises 

0 

0 

0 

0 

0 

20 

25 

15 

38 

10 

0 

0 

0 

30 

0 

14 

0 

0 

0 

10 

Expeditor 

0 

0 

0 

0 

6 

15 

LZ 

2Q. 

30 

18 

0 

1 

0 

45 

0 

0 

0 

0 

0 

11 

Fiscal 

0 

0 

0 

0 

0 

16 

0 

0 

0 

0 

71_ 

0 

0 

0 

0 

0 

12 

General 

0 

0 

0 

50 

0 

7 

0 

0 

0 

0 

( 

19 

0 

63 

0 

7 

2 

0 

4 

13 

Maintenance 

0 

0 

0 

0 

0 

42 

0 

0 

12 

0 

( 

0 

28 

52 

0 

0 

18 

0 

0 

14 

Miscellaneous 

1 

0 

12 

0 

0 

0 

0 

0 

0 

0 

( 

12 

0 

120 

0 

0 

0 

0 

7 

NBC 

15 

0 

0 

60 

0 

0 

6 

3 

4 

0 

0 

c 

0 

0 

70 

9 

0 

0 

0 

0 

16 

Records  * 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

17 

Requests 

0 

0 

40 

0 

0 

20 

0 

15 

0 

0 

13 

17 

0 

0 

0 

0 

47 

0 

0 

18 

Sbrd  Billeting 

8 

0 

15 

0 

6 

0 

0 

0 

39 

0 

0 

12 

0 

0 

0 

9 

0 

63 

0 

19 

InfoTech 

70 

0 

60 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

12 

0 

0 

0 

0 

10 

Table  13.  19x19  Confusion  Matrix  and  Potential  Category  Clusters. 

*  Not  enough  data  to  accurately  categorize 


B.  DATA  ANALYSIS 

Results  indicate  that  an  average  of  65%  of  all  documents  were  correctly  classified 
into  their  respective  category.  Of  the  19  categories,  15  had  greater  than  50%  probability 
of  being  properly  classified.  The  remaining  4  categories  were  not  properly  classified  for 
several  reasons.  In  the  case  of  categories  #4  (Personal),  and  #19  (Info_Tech),  these  were 
frequently  confused  with  one  another  and  with  category  #14  (Miscellaneous).  For 
categories  #5  (BWD  Mess)  and  #16  (Records)  there  were  not  enough  examples  to  train 
on.  Category  #19  (Info  Tech)  was  often  confused  with  category  #I  (Classes).  Larger 
categories  had  better  precision;  smaller  categories  demonstrated  higher  recall.  Average 
recall  rates  were  acceptable,  but  precision  rates  were  disappointing  and  can  be 
contributed  to  categories  clustered  together. 

Due  to  the  unique  nature  of  the  corpus,  each  of  the  categories  seemed  to  have 
certain  cluewords  that  only  helped  it.  In  some  categories  stop  words  could  have  been 
good  discriminators,  such  as  category  #3  (Personal)  where  the  words  “can”  and  “do” 
frequently  show  up. 
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Results  with  double  weighting  of  the  document  “Subject”  line  show  minimal 
increase  in  overall  success  probability.  Table  14  shows  actual  number  of  words  in 
categories  as  compared  against  the  number  of  non-example  documents.  The  table 
identifies  the  low  number  of  training  examples  for  smaller  categories  such  as  #5  (BWD 


Mess). 


Category  Name 

Actual  #  words  in 

Actual  #  of  words  in  category 

category  (yes  -  examples) 

(no  -  examples) 

Classes 

78,939 

562,272 

Grades 

25,952 

617,491 

Personal 

417,982 

239,196 

3270 

1,634 

638,416 

BWD  Mess 

237 

639,813 

Equipment 

2126 

637,924 

Equipment 

Allowance 

1945 

638,105 

Equipment 

Readiness 

1901 

638,149 

Exercises 

481 

639,569 

Expeditor 

486 

639,564 

Eiscal 

3757 

636,293 

General 

538 

639,512 

Maintenance 

881 

639,169 

Miscellaneous 

3332 

636,718 

NBC 

685 

639,365 

Records 

65 

639,985 

Requests 

3233 

640,050 

Shipboard 

Billeting 

1500 

641,783 

Info_Tech 

115,961 

541,280 

Table  14.  Actual  Number  of  Words  Identified  for  Positive  Examples  Versus  Non- Examples. 
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VI.  CONCLUSIONS  AND  RECOMMENDATIONS 


A.  CONCLUSIONS 

This  thesis  examined  automatic  text  categorization  of  email  documents.  The  use 
of  keywords  and  their  conditional  probabilities  was  the  primary  method  used.  Final 
recall  and  precision  results  were  65%  and  17%  respectively.  A  stop-word  list  and 
destemmer  program  proved  to  be  very  helpful  when  dealing  with  text  categorization. 

B.  RECOMMENDATIONS 

This  thesis  could  be  extended  by  incorporating  term  phrases  to  improve 
categorization.  If  possible,  a  more  standardized  corpus  of  text  should  be  used  with 
approximate  equal  number  of  documents  per  category.  A  program  to  accurately  strip  out 
all  HTML  characters  and  other  special  characters  for  non-text  would  be  helpful.  Finally, 
a  program  or  method  to  deal  with  personal  names  might  prove  very  helpful  in  some 
categories. 
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APPENDIX  A.  CLUEWORDS  SAMPLE  OUTPUT 


0.03878220540933329  overall  probability,  18146  yes  examples,  449749  no 
examples , 

0.0  0  12  shape 
0.0  0  17  shall 
0.024  3  122  write 
0.0  0  72  friend 
0.0  0  37  certificate 
0.0  0  14  comfort 
0.0  0  17  netscape 
0.0  0  43  bwlogu 

0.00392156862745098  1  254  usmc 

0.0  0  14  considerate 

0.05834683954619125  36  581  monterey 

0.0  0  13  justin 

0.06666666666666667  6  84  here 

0.0  0  40  lejeun 

0.0  0  34  hell 

0.0  0  83  effect 

0.0  0  15  comment 

0.0  0  51  sans-serif 

0.0125  1  79  hear 

0.05782060785767235  78  1271  head 
0.0  0  17  f riday 
0.0  0  17  urge 

0.06481481481481481  7  101  strategy 

0.06484641638225255  57  822  subject 

0.17073170731707318  7  34  interact 

0.0  0  188  Mike 

0.0  0  18  your-account 

0.0  0  11  extreme 

0.0  0  29  prodigy 

0.0  0  63  Fred 

0.06097560975609756  40  616  http-equiv 
0.0  0  38  script 

0.17391304347826086  4  19  before 

0.0  0  13  high-spee 

0.1111111111111111  2  16  amador 

0.0  0  62  Logue 

0.0  0  12  accommodate 

0.0  0  26  simply 

0.0  0  26  upon 

0.014492753623188406  1  68  federal 
0.0  0  45  false 

0.13333333333333333  2  13  adrian 
0.0  0  17  hidden 


**  The  list  continues  for  up  38  pages. 
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APPENDIX  B.  CLUEWORDS  PROGRAM 


*  Title:  ClueWords 

* 

*  Description:  This  class  performs  two  functions.  One  is  to  take  out 

*  each  non-stop  word  and  then  looks  at  two  subdirectories  of  "yes" 

*  and  "no",  find  the  conditional  probabilities  of  "yes" 

*  given  the  occurrence  of  a  particular  word.  Initial  use  is  to 

*  test  against  emails  and  try  to  categorize  them  appropriately. 

* 

*  Elements  adapted  from  Dr.  Neil  Rowe's  programs  GetClueProbs 

*  and  Count  Words. 

* 

*  Copyright:  Copyright  (c)  2002 

*  Company:  USMC  NPS 

*  ©author  Scott  R.  Hall 

*  ©version  1.0 
*/ 

import  java.io.*; 
import  java.util.*; 

public  class  ClueWords 

{ 

public  static  void  main  (String  args[])  throws  lOException 

{ 


// - 

//  Data  Member  Declarations 

// - 

*  call  Parser  method.  Used  only  for  testing. 

*/ 

//  Parser  (); 

//  Mincount  sets  the  mininum  number  of  times  that  a  word 

//  must  appear  in  order  to  have  its'  probability  calculated. 

double  Mincount  =  10;//was  10 

//  An  integer  declaration 

intj; 

//  Long  integer  data  types  for  4  items  that  allow  better 

//  granularity  for  calculating  probabilities. 

long  Oldcount,  Count,  wordyescount,  wordnocount; 

//  Integer  declarations  and  assignment  values. 
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int  yescount  =  0; 
int  nocount  =  0; 


//  Double  real  number  declarations  for  5  items  that  allow 

//  Standard  Deviation  and  Probabilit  ies  to  be  displayed  properly. 

double  yesratio,  yesprob,  Dev,  Prob,  SD; 

//  String  characters  declared  to  include  a  string  tokenizer 

//  to  help  extract  tokens  from  emails. 

String  Inputline,  Word,  Stopword; 

StringTokenizer  st; 

//  File 

File  Dir; 

//  Declaring  and  creating  HashSet.  Implements  Set  using  an 

//  internal  hashtable.  Allows  any  type  of  object  or  null  to 

//  be  a  member  of  the  set.  There  is  no  guarantee  of  order 

//  for  the  set  elements.  There  are  no  duplicates  in  a  HashSet. 

HashSet  rchs  =  new  HashSet(); 

//  Within  the  Destemmer  class  call  the  hashKnownWords  method 

//  and  pass  it  results  of  the  rchs. 

Destemmer.hashKnownWords(rchs); 

//  Declaration  and  creation  of  hashmap.  Same  thing  as  a  Hashtable 

//  but  methods  are  not  synchronized. 

HashMap  hm  =  new  HashMap(200000); 

//  Declarations  below  imported  from  CountWords  program. 

HashSet  hsstop  =  new  HashSet(lOOO); 

FileReader  frl  =  new  FileReader("stopwords.txt"); 

BufferedReader  brl  =  new  BufferedReader(frl); 

while  ((Stopword  =  brl.readLine())  !=  null)  hsstop. add(Stopword); 

//  Creating  a  new  instance  of  the  Directory  object  and  passing 

//  it  the  contents  of  ''yes”  directory. 

Dir  =  new  File(''yes''); 

String  Filelist  []  =  Dir.list(); 

//  A  "for”  loop  to  go  through  "yes/"  directory  and  read  in 

//  flies  via  buffered  reader. 

for  (j=0;  j<Filelist. length;  j++) 

{ 

FileReader  fr  =  new  FileReader(”yes/”  +  Filelist(j]); 
BufferedReader  br  =  new  BufferedReader(fr); 
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//  Inner  "while"  loop  while  the  buffered  reader  is  not  empty(null) 

//  create  a  new  String  Tokenizer  Object  and  tokenize  based  on  the 

//  characters  identified. 

while  ((Inputline  =  br.readLine())  !=  null) 

{ 

St  =  new  StringTokenizer(Inputline,"  ,.;:'-*?!()[]{  }_+=l\\\"<>/@#&*"); 

//  Another  inner  "while"  loop  that  loops  through  each  token  while 

//  there  are  more  tokens  left  to  tokenize.  If  the  is  not  a  number 

//  string  than  increment  "yescount"  and  destem  the  word. 

// 

while  (st.hasMoreTokensO) 

{ 

Word  =  st.nextTokenO; 

if  ((Word.length()>l)  &&  (!numberString(Word))  && 

( !  hsstop  .contains  (W  ord))) 

{ 

yescount++; 

Word  =  Destemmer.destem(Word,rchs); 

if  (!hm.containsKey(Word)) 

{ 

hm.put(Word,new  Long(  1000000)); 


else 

{ 

Oldcount  =  (Long)hm.get(Word)).longValue(); 
hm.put(Word,new  Long(  1  OOOOOO+Oldcount)); 
)//end  of  last  if  statement 
)//end  of  "yescount"  if  statement 
)//end  of  second  while  statement 
)//end  of  first  while  statement 
fr.close();//close  out  of  file  reader 
)//end  of  for  statement 


//  Same  statements  except  for  no  category. 


Dir  =  new  File("no"); 

String  Filelist2  []  =  Dir.list(); 
for  (j=0;  j<Filelist2. length;  j++) 

{ 

FileReader  fr2  =  new  FileReader("no/"  +  Filelist2(j]); 

BufferedReader  br2  =  new  BufferedReader(fr2); 
while  ((Inputline  =br2.readLine())  !=  null) 

{ 

St  =  new  StringTokenizer(Inputline," ,.;:'-''?!()[]{ }_+=l\\\"<>/@#&*"); 
while  (st.hasMoreTokensO) 

{ 

Word  =  St.nextTokenO; 

if  ((Word.length()>l)  &&  (!numberString(Word))  && 

( !  hsstop  .contains  (W ord))) 
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nocount++; 

Word  =  Destemmer.destem(Word,rchs); 
if  (!hm.containsKey(Word)) 
{hm.put(Word,new  Long(l)); 

I 


else 

{ 

Oldcount  =  ((Long)hm.get(Word)).longValue(); 
hm.put(Word,new  Long(l+01dcount)); 


fr2.close(); 

I 


PrintWriter  fileout  =  new  PrintWriter(new  FileWriter("clueprobs.out'')); 

if  (nocount  >  0)  yesratio  =  (double)yescount/(double)nocount; 

else  yesratio  =  2.0*(double)yescount; 

yesprob  =  (double)yescount/(double)(yescount+nocount); 

fileout. println(yesprob  +  "  overall  probability, "  +  yescount  +  "  yes  examples,  ''  +  nocount 
+  "  no  examples, "); 

Set  set  =  hm.entrySetO; 

Iterator  i  =  set.iterator(); 

while  (i.hasNextO) 

{ 

Map. Entry  me  =  (Map. Entry )i.next(); 

Word  =  (String)me.getKey(); 

Count  =  ((Long)me.getValue()).longValue(); 
wordnocount  =  Count  %  1000000; 
wordyescount  =  (Count-wordnocount)/ 1000000; 

Dev  =  (double)wordyescount-(yesratio*(double)wordnocount); 

Prob  =  (double)wordyescount/(double)(wordyescount+wordnocount); 

SD  =  Math.sqrt(  1 .0/((  1 .0/(double)wordyescount)+(  1 .0/(double)wordnocount))); 
if  (((wordyescount+wordnocount)>Mincount)  &  (Math.abs(Dev)  >  SD)) 
fileout. println(Prob  +  "  +  wordyescount  +  "  "  +  wordnocount 
+  "  "  +  Word); 


} 

fileout. close(); 


/*  Says  whether  a  string  of  characters  represents  an  integer  or  decimal  */ 
private  static  boolean  numberString  (String  S) 

{ 

boolean  numberflag  =  false; 
int  N  =  S.lengthO; 
if  (N  >  0) 

{ 


int  i=0; 
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if  (S.charAt(O)  ==  i=l; 
charC; 

numberflag  =  true; 

while  ((numberflag)  &  (i<N)) 

{ 

C  =  S.charAt(i); 

numberflag  =  (((C  >=  ’O')  &  (C  <=  '9'))  I  (C  == 
i++; 


return  numberflag; 


//For  now  I  am  remarking  this  call  out 
/* 

public  static  void  Parser  () 

{ 

Parser  parser  =  new  Parser  (''yessupplyfiscal.txt''); 

I 

*/ 

} 
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APPENDIX  C.  RATEDOCS  PROGRAM 


//  Given  a  directory  "unknown"  of  files  of  unknown  relevance,  rates 
//  each  document  for  the  appearance  of  clues  in  the  clueprobs.out  file. 

//  Author:  Neil  C.  Rowe,  9/01.  Modified  with  permission  by  Scott  R.  Hall 
import  java.io.*; 
import  java.util.*; 


class  RateDocs 

{ 

public  static  void  main  (String  args[])  throws  lOException 

{ 

int  j,  kl,  k2,  Wordcount,  M; 

double  yesratio,  Dev,  Prob,  SD,  Average,  Total; 

Double  DProb; 

String  Inputline,  Word,  Probstring; 

StringTokenizer  st; 

File  Dir; 

HashSet  rchs  =  new  HashSet(); 

Destemmer.hashKnownWords(rchs); 

HashMap  hm  =  new  HashMap(200000); 

FileReader  fr; 

BufferedReader  br; 

String  tempString  =  new  String  ();//temporary  hold  string  object  for  subject  line 

String  subjectLine  =  new  String  (); 

boolean  foundSubj  =  false;//flag  set  to  find  subject  line 

FileReader  frprobs  =  new  FileReader("clueprobs.out 

BufferedReader  brprobs  =  new  BufferedReader(frprobs); 

Inputline  =  brprobs. readLine(); 
k2  =  Inputline. lastIndexOfC  '); 
kl  =  Inputline. lastIndexOfC  ',k2-l); 

double  Totalprob  =  Double. valueOf(Inputline.substring(kl+l,k2)).doubleValue(); 
while  ((Inputline  =  brprobs .readLine())  !=  null) 

{ 

kl  =  Inputline. indexOfC  '); 
k2  =  Inputline. lastIndexOfC  '); 

M  =  Inputline. lengthO; 

Probstring  =  Inputline. substring(0,kl); 

Prob  =  (Double. valueOf(Probstring).doubleValue())  -  Totalprob; 

Word  =  Inputline. substring(k2+l,Inputline. lengthO); 
hm.put(Word,  new  Double(Prob)); 

I 

frprobs. close)); 


Dir  =  new  File("unknown");//  Begin  unknown  directory  here 
String  Filelist  []  =  Dir.list(); 
for  (j=0;  j<Filelist. length;  j++) 


Wordcount  =  0; 

Total  =  0.0; 

fr  =  new  FileReader) "unknown/"  +  Filelist(j]); 

br  =  new  BufferedReader(fr); 

while  ((Inputline  =  br.readLine)))  !=  null) 

{ 
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St  =  new  StringTokenizer(Inputline,"  ,.;:'-*?!()[]{  }_+=l\\\"<>/@#&*"); 
boolean  foundSubj  =  false; 

//use  "Subj:"  forhtml 

if  (IfoundSubj  &&  Inputline. indexOf("Subject:")  >=  0  ) 

{ 

subjectLine  =  Inputline; 

foundSubj  =  true;//change  flag  to  true 

Wordcount++; 


PrintWriter  fileout  =  new  PrintWriter(new  FileWriter("RATEDOCS.out'')); 

while  (st.hasMoreTokensO) 

{ 

Word  =  st.nextTokenO; 

if  ((Word.length()>l)  &&  (!numberString(Word))) 

{ 


Word  =  Destemmer.destem(Word,rchs); 

Wordcount++; 
if  (hm.containsKey(Word)) 

{ 

DProb  =  (Double)hm.get(Word); 

System. out.println(DProb  +  ''  retrieved  for  ''  +  Word); 

Total  =  Total+(DProb.doubleValue()); 

fileout. println(DProb  +  ''  retrieved  for  "  +  Word); 

I 


fileout. close)); 

)//outer  if  statement  for  subject  line  find 
)  //while  close 

fr.closeO; 

Average  =  Total/(double)Wordcount; 

System.out.println) Average  +  ''  strength  for  document ''  +  Filelist[j]); 


/*  Says  whether  a  string  of  characters  represents  an  integer  or  decimal  */ 
private  static  boolean  numberString  (String  S) 

{ 

boolean  numberflag  =  false; 
int  N  =  S. length)); 
if  (N  >  0) 

{ 

int  i=0; 


if  (S.charAt(O)  ==  '-')  i=l; 
charC; 

numberflag  =  true; 
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while  ((numberflag)  &  (i<N)) 

{ 

C  =  S.charAt(i); 

numberflag  =  (((C  >=  ’O')  &  (C  <=  '9'))  I  (C  == 
i++; 


return  numberflag; 
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APPENDIX  D.  SAMPLE  RUN  EROM  RATE  DOCS 


Test  Set  -  Rate  Docs  for  grades  run  against  clueprobs 

0.9326399520216222  retrieved  for  qpr 
0.9326399520216222  retrieved  for  qpr 
0.9326399520216222  retrieved  for  qpr 
0.9326399520216222  retrieved  for  qpr 
0.9326399520216222  retrieved  for  qpr 
0.9326399520216222  retrieved  for  qpr 
0.848555867937538  retrieved  for  nw3230 
0.6519746713563415  retrieved  for  logistic 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 
0.6172012256239916  retrieved  for  grade 

*  Document  continues  for  up  to  38  pages. 
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APPENDIX  E.  PARSER  PROGRAM 


*  Title:  Parser 

*  Description:  (1)  Reads  a  text  file  (emails  saved  as  *.txt  file) 

*  (2)  Finds  Subject  line  and  parses  it  to  find  keywords 

*  (3)  Reads  entire  file  and  counts  the  frequncy  of  occurance  of  key  words  in 

*  (4)  Prints  subject  line  keywords  &  freqs  to  screen 

* 

*  Some  elements  adapted  from  Steve  Simmon’s  Parser  program 

*  ©version  1.0 
*/ 

import  java.io.*; 
import  java. util.*; 

public  class  Parser  { 

//Class  Variables  (Global) 

String  fileName  =  new  String)); 

StringBuffer  filetext  =  new  StringBuffer(); 

String  subjectLine  =  new  String)); 

String  keyClueWord  =  new  String)); 

Vector  subj Keywords  =  new  Vector)); 

//  Constructor 

public  Parser(String  fileinput)  { 

//get  the  filename  from  the  commandline  argument 
fileName  =  fileinput; 

ReadFile)); 

ParseSubject)); 

ParseEmailText)); 

//ReadClueProbs)); 

)  //end  Constructor 


//  Method:  ReadFile 

void  ReadFile))  { 

String  tempString  =  new  String)); 
boolean  foundSubj  =  false; 
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try{ 


BufferedReader  fileReader  =  new  BufferedReader(new  FileReader(fileName)); 
while(fileReader.ready()) 

{ 


//Read  each  line  of  the  email  text  file  &  store  in  string  buffer 
tempString  =  fileReader.readLine(); 

//make  all  lowercase 

tempString  =  tempString. toLowerCase(); 


//Find  subject  line,  change  it  from  just  "subj” 
If(!foundSubj  &&  tempString. indexOf(''subj'')  >=  0  ) 
{ 

subjectLine  =  tempString; 
foundSubj  =  true; 

//Debug  print  out 

System. out.println("Subject  line:  ''  +  subjectLine  ); 


//add  line  read  to  String  Buffer,  goes  to  frequency  count 
filetext.append(tempString); 

)  //end  While 

)  //end  try  stmt 

//Opening  a  file  via  FileReader  object  can  throw  FileNotFound  Exception 
catch(FileNotFoundException  fileEX)  { 

} 


//Reading  text  in  from  a  file  can  throw  an  lOException 
catch(IOException  IOEX){ 

} 


)  //end  method  ReadEile 


//  Method:  ParseSubject 

void  ParseSubjectO 

{ 


//create  a  String  Tokenizer  from  the  string  that  is  the  subject  line 
//default  tokinizing  is  to  break  string  into  words 
StringTokenizer  subjectWords  =  new  StringTokenizer(subjectLine); 
String  tempString  =  new  StringO; 
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while  (subj  ectW ords  .hasMoreT okens  ()) 

{ 


tempString  =  subjectWords.nextToken(); 
//Debug  print  out 

System. out.printlnC'Token:  "  +  tempString  ); 

//check  to  see  if  word  id  a  key  word; 

//if  the  word  is  a  keyword,  add  to  vector 
if(KeyWord(tempString)) 

{ 


KeyWord  temp  =  new  KeyWord(tempString); 
subjKeyWords.add(temp); 


)  //end  While 

)  //end  method  ParseSubject 


//  Method:  KeyWord 


boolean  KeyWord(String  text){ 
boolean  IsKeyWord  =  true; 

String  smallWords[]  =  {''and'',  "the",  "a", 
"subject",  "subj",  "re", 

String  tempString  =  new  StringO; 


"an",  "if",  "it",  "is",  "this", 

"  ",  "FW:", 

"fwd:" }; 


for(int  i  =  0;  i  <  smallWords. length;  i++) 

{ 

tempString  =  smallWords[i]; 


if(text.  starts  With(smallW  ords  [i] )) 

{ 

IsKeyWord  =  false; 
break; 

} 


)  //end  for  loop 
return  IsKeyWord; 

)  //end  method  KeyWord 

//  Method:  ParseEmailText 

void  ParseEmailTextO 

{ 
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Iterator  KeyWordITR  =  subjKeyWords.iterator(); 

KeyWord  tempKeyWord  =  new  KeyWord("txt'');//dummy  variable 

int  counter  =  1; 

int  keyWordFreq  =  0; 

while(KeyWordITR.hasNext()) 

{ 

//Get  keyword  from  vector  keywords  in  subject 

tempKeyWord  =  (KeyWord)  KeyWordITR. next(); 

//reset  Freq 

keyWordFreq  =  0; 

//Get  Freq  for  this  word 

keyWordFreq  =  getFrequency(tempKeyWord.keyword,  filetext.toStringO); 
tempKeyWord.  frequency  =  keyWordFreq; 
if  (KeyWordFreq  >  2) 

{ 


//print  out  result  to  screen 

System. out. println("Subject  Keyword  "  +  counter++  +  "  +  tempKeyWord. key  word  + 

''  Frequency:  ''  +  tempKeyWord. frequency); 

)  //end  while  stmt 
)  // end  Method  ParseEmailText 

//  Method:  getFrequency 

int  getFrequency(String  keyWord,  String  file) 

{ 

int  count  =  0; 
int  index  =  -1; 

index  =  file.indexOf(keyWord); 

//1st  occurance  of  keyword  found 
if(index  >=  0) 

{ 


//increment  count  and  make  recursive  call  to  this  function  with  remaining 
//text  less  all  words  up  to  and  including  the  found  keyword  occurance 
count  =  1  +  getFrequency(keyWord,  file. substring/index  +  keyWord. length)))); 


return  count; 

I 

//  Method:  main 

public  static  void  main(String[]  args) 

{ 

Parser  parserl  =  new  Parser(''testparser.txt");//was  args[0]  or''cs4556_8.txt'' 

I 

} 
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